Overview
In this assessment we aim to use the MACCDC conn data to perform data analysis and modelling. We first must import the data.
mydata <- read.csv("MAC.csv")
mydata <- data.frame(mydata)
mydata
We first want to look for missing data. Service, duration, orig_bytes, resp_bytes and local_orig all seem to have missing data in them so we will see what percentage.
mtab0=data.frame(
missingduration=is.na(mydata[,"duration"]),
proto=mydata[,"proto"])
mtab0=table(mtab0)
(apply(mtab0,2,function(x)x/sum(x)))
proto
missingduration icmp tcp udp
FALSE 0.8585338 0.1656118 0.3089144
TRUE 0.1414662 0.8343882 0.6910856
mtab1=data.frame(
missing_orig_bytes=is.na(mydata[,"orig_bytes"]),
proto=mydata[,"proto"])
mtab1=table(mtab1)
(apply(mtab1,2,function(x)x/sum(x)))
proto
missing_orig_bytes icmp tcp udp
FALSE 0.8585338 0.1656118 0.3089144
TRUE 0.1414662 0.8343882 0.6910856
mtab2=data.frame(
missing_resp_bytes=is.na(mydata[,"resp_bytes"]),
proto=mydata[,"proto"])
mtab2=table(mtab2)
(apply(mtab2,2,function(x)x/sum(x)))
proto
missing_resp_bytes icmp tcp udp
FALSE 0.8585338 0.1656118 0.3089144
TRUE 0.1414662 0.8343882 0.6910856
mtab3=data.frame(
missing_local_orig=is.na(mydata[,"local_orig"]),
proto=mydata[,"proto"])
mtab3=table(mtab3)
(apply(mtab3,2,function(x)x/sum(x)))
icmp tcp udp
1 1 1
Thus we are missing the local_orig feature for every data point in the data set. We may then consider dropping this entire column as it serves no use to us and we cannot impute the data without prior knowledge of the data set and what it should look like. The duration, orig_bytes and resp_bytes all appear to be missing exactly the same data - on further analysis, we see that whenever one is missing, all three are missing.
Some initial data cleansing will come from removing the X column and the ts column. The X column is produced by the sampling and since we have a random sample of the data, the ts provides no real information on the data.
unique_uid <- mydata[!duplicated(mydata[,c('uid')]),]
unique_uid
Thus all our uid’s are unique and therefore wont provide us with any extra information either since they will be uncorrelated with the rest of the data. This is the only column with this trait, and all other columns have values which occur more than once so we can drop the uid column too.
drop_columns <- c("X","ts","local_orig","uid")
mydata <- mydata[, !names(mydata) %in% drop_columns]
head(mydata)
So we have removed the columns that didn’t provide us with any extra information. We will now extract the data we will use for DBSCAN to create clusters. The following code is pulled from Alex’s workbook and allows us to pull out 7 of the features to use for DBSCAN and ensures all elements are numeric.
miss.me <- vector(length = nrow(mydata))
miss.me <- rep(0, times = nrow(mydata))
for(i in 1:nrow(mydata)) {
if(is.na(mydata$duration[i])) { miss.me[i] <- 1 }
}
str(mydata)
'data.frame': 226943 obs. of 16 variables:
$ id.orig_h : chr "192.168.202.110" "192.168.202.83" "192.168.202.110" "192.168.202.138" ...
$ id.orig_p : int 50427 46442 12662 35203 63958 39436 52132 41741 40224 63022 ...
$ id.resp_h : chr "192.168.22.102" "192.168.206.44" "192.168.22.1" "192.168.24.187" ...
$ id.resp_p : int 15457 42 17800 10629 56587 33384 32776 7476 1688 2010 ...
$ proto : chr "tcp" "tcp" "tcp" "tcp" ...
$ service : chr "-" "-" "-" "-" ...
$ duration : num NA NA NA NA NA NA NA NA NA NA ...
$ orig_bytes : int NA NA NA NA NA NA NA NA NA NA ...
$ resp_bytes : int NA NA NA NA NA NA NA NA NA NA ...
$ conn_state : chr "REJ" "REJ" "S0" "S0" ...
$ missed_bytes : int 0 0 0 0 0 0 0 0 0 0 ...
$ history : chr "Sr" "Sr" "S" "S" ...
$ orig_pkts : int 1 1 1 1 1 1 1 1 1 1 ...
$ orig_ip_bytes: int 48 60 48 44 48 44 60 60 60 44 ...
$ resp_pkts : int 1 1 0 0 0 1 1 0 1 0 ...
$ resp_ip_bytes: int 40 40 0 0 0 40 40 0 40 0 ...
mydata.good <- as.data.frame(cbind(id.orig_p = mydata$id.orig_p, id.resp_p = mydata$id.resp_p,
orig_pkts = mydata$orig_pkts, orig_ip_bytes = mydata$orig_ip_bytes,
resp_pkts = mydata$resp_pkts, resp_ip_bytes = mydata$resp_ip_bytes))
mydata.good<- cbind(mydata.good, miss.me)
head(mydata.good)
str(mydata.good) # Should be only ints and nums
'data.frame': 226943 obs. of 7 variables:
$ id.orig_p : int 50427 46442 12662 35203 63958 39436 52132 41741 40224 63022 ...
$ id.resp_p : int 15457 42 17800 10629 56587 33384 32776 7476 1688 2010 ...
$ orig_pkts : int 1 1 1 1 1 1 1 1 1 1 ...
$ orig_ip_bytes: int 48 60 48 44 48 44 60 60 60 44 ...
$ resp_pkts : int 1 1 0 0 0 1 1 0 1 0 ...
$ resp_ip_bytes: int 40 40 0 0 0 40 40 0 40 0 ...
$ miss.me : num 1 1 1 1 1 1 1 1 1 1 ...
for(i in 1:ncol(mydata.good)) { mydata.good[,i] <- as.numeric(mydata.good[,i]) }
str(mydata.good) ## All should be nums now
'data.frame': 226943 obs. of 7 variables:
$ id.orig_p : num 50427 46442 12662 35203 63958 ...
$ id.resp_p : num 15457 42 17800 10629 56587 ...
$ orig_pkts : num 1 1 1 1 1 1 1 1 1 1 ...
$ orig_ip_bytes: num 48 60 48 44 48 44 60 60 60 44 ...
$ resp_pkts : num 1 1 0 0 0 1 1 0 1 0 ...
$ resp_ip_bytes: num 40 40 0 0 0 40 40 0 40 0 ...
$ miss.me : num 1 1 1 1 1 1 1 1 1 1 ...
# sum(mydata.good$miss.me)/nrow(mydata.good) ## 82.7% missing
I dont want to drop any data that may be important so I’ll also use the protocol, connection state and history features in my analysis.
proto <- as.factor(c(mydata$proto))
proto <- unclass(proto)
conn_state <- as.factor(c(mydata$conn_state))
conn_state <- unclass(conn_state)
history <- as.factor(c(mydata$history))
history <- unclass(history)
mydata.good$proto <- proto
mydata.good$conn_state <- conn_state
mydata.good$history <- history
mydata.good
## We'll do 10-fold CV and then apply DBSCAN, training on 90%
dg <- mydata.good
ran <- sample(1:nrow(dg), 0.9 * nrow(dg))
nor <-function(x) { (x -min(x))/(max(x)-min(x)) }
dg_norm <- as.data.frame(lapply(dg, nor))
# head(dg_norm)
dg_train <- dg_norm[ran,] ## extract training set
dg_test <- dg_norm[-ran,] ## extract testing set
dg_target_cat <- dg[ran, ncol(dg)]
dg_test_cat <- dg[-ran, ncol(dg)]
SVD
Now we can look at running DBSCAN on our data. We first need to perform PCA to figure out how many principle components to use in DBSCAN.
library("dbscan")
dg_train.svd <- svd(dg_train)
plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")

plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")

Plotting with a log y-axis and a normal y-axis give strikingly different results. The first eigenvector explains most of the variance and the 5 after that seeming explain almost the same amount of variance between them as the first one. I don’t think using just the first eigenvalue would provide that much insight and therefore will use 6 of them (this is still reducing dimensionality by almost half).
npcs = 6
We now plot the PCA to visualise the clusters formed here. We’re not plotting according to any categorical data i.e. normal vs non-normal so we may not get that much information from this.
i=1;j=2
plot(dg_train.svd$u[,i],
dg_train.svd$u[,j],type="p",
col="#33333311",pch=16,cex=1)

Finding Parameters for DBSCAN
Eps specifies how close the points should be to each other to form a cluster. If the distance is less than eps, they are considered neighbours. We find this number by finding the ‘knee’ in the plot below. I have chosen to use 7 neighbours here.
test=kNNdist(dg_train.svd$u[,1:npcs], k = 7,all=TRUE)
testmin=apply(test,1,min)
plot(sort(testmin[testmin>1e-8]),log="y")
threshholds= c(0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.0001, col="red")

So we choose h=0.0001 as our limit since this allows us to capture most of the information here. We also need to define our minimum number of points to form a cluster. The recommendation is to use minPts = 2*dim for large data sets to ensure we find significant clusters and avoid noise so this is what we shall choose.
##DBSCAN
Now we finally perform DBSCAN.
minPts = c(20, 25, 30, 35, 40, 45, 50, 75, 100, 125, 150, 175, 200, 225, 250)
clustercounts = c()
for(val in minPts) {
dbscanres = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = val)
clustercounts[val] <- (length(unique(dbscanres$cluster)))
}
clustercounts
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 209 NA
[22] NA NA NA 143 NA NA NA NA 157 NA NA NA NA 136 NA NA NA NA 110 NA NA
[43] NA NA 99 NA NA NA NA 111 NA NA NA NA NA NA NA NA NA NA NA NA NA
[64] NA NA NA NA NA NA NA NA NA NA NA 98 NA NA NA NA NA NA NA NA NA
[85] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 85 NA NA NA NA NA
[106] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 40 NA
[127] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[148] NA NA 25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[169] NA NA NA NA NA NA 17 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[190] NA NA NA NA NA NA NA NA NA NA 22 NA NA NA NA NA NA NA NA NA NA
[211] NA NA NA NA NA NA NA NA NA NA NA NA NA NA 21 NA NA NA NA NA NA
[232] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 24
Thus we seem to hit some limiting value at 175 since a minPts of 200 has a higher amount of clusters than a minPts of 175. Since we want ou data to be interpretable we may look at reducing the amount of clusters and therefore the noise. Thus we will use an initial minPts of 175.
The output we get from this isn’t very insightful though and leaves a lot to be desired.
dbscan175 = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = 175)
dbscan50 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 50)
dbscan30 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 30)
library(cluster)
# trying to calculate the silhouette score of this clustering to see if its valid or not - currently reports Error: Vector memory exhausted (limit reached?) - I've tried looking into work arounds but cant get anything working so I'll leave this for now.
#ss <- silhouette(dbscanres$cluster, dist(dg_train.svd$u))
Plotting resulting clusters
for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab="",
ylab="",
col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
par(new=TRUE)
}
}

for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab="",
ylab="",
col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
par(new=TRUE)
}
}

for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab="",
ylab="",
col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
par(new=TRUE)
}
}

for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab="",
ylab="",
col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
}
}















for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab="",
ylab="",
col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
}
}















for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab="",
ylab="",
col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
}
}















jpeg("Assessment2 DBSCAN Clustering.jpg")
for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab="",
ylab="",
col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
par(new=TRUE)
}
}
jpeg("Assessment2 DBSCAN Clustering Eigenvector split.jpg")
op <- par(mfrow=c(5,3))
for (k in 1:5){
a = seq(k+1,6)
for (l in a){
if(k==l){next}
plot(dg_train.svd$u[,k],
dg_train.svd$u[,l],xlab=k,
ylab=l,
col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
}
}
par(op)
dev.off()
null device
1
dg_train.clustered <- data.frame(dg_train)
dg_train.clustered$cluster <- dbscan175$cluster
dg_train.clustered
M matrix
Now we look at the M matrix produced by Alex that is a sparse matrix showing connections between origin IP’s and response IP’s.
library(Matrix)
library(irlba)
library(Rtsne)
So1 <- tapply(mydata$id.orig_h, mydata$id.orig_h)
De1 <- tapply(mydata$id.resp_h, mydata$id.resp_h)
Est <- as.matrix(cbind(So1, De1))
M<- sparseMatrix(i=Est[,1], j=Est[,2])
M.svd = svd(M)
plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")

plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")

From the log axis it looks like we need the first ~100 eigenvalues but using the normal plot, it looks like we an get away with using ~30.
npcsM = 30
testM=kNNdist(M.svd$u[,1:npcsM], k = 7,all=TRUE)
testminM=apply(testM,1,min)
plot(sort(testminM[testminM>1e-15]),log="y")
threshholds= c(0.1,0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.5, col="red")

It looks like we want eps = 0.5 although we dont seem to get a knee in the data so it’s hard to pinpoint this.
MminPts = c(20, 50, 200)
clustercountsM = c()
for(val in MminPts) {
dbscanresM = dbscan(dg_train.svd$u[,1:6],eps = 0.5,minPts = val)
clustercountsM[val] <- (length(unique(dbscanresM$cluster)))
}
References:
Data from SecRepo
Converting categorical variables
Adding columns to data frames
Finding Unique Values
DBSCAN on flowers
Saving Plots
DBSCAN Parameter Estimation
Finding the knee in kNNDist
Silhouette Score introduction
Error with silhouette score
Silhouette Function
---
title: "Assessment 2 - Matt"
output:
  html_notebook: default
---
# Overview
In this assessment we aim to use the MACCDC conn data to perform data analysis and modelling. We first must import the data.

```{r}
mydata <- read.csv("MAC.csv")
mydata <- data.frame(mydata)
```

```{r}
mydata
```
We first want to look for missing data. Service, duration, orig_bytes, resp_bytes and local_orig all seem to have missing data in them so we will see what percentage.

```{r}
mtab0=data.frame(
    missingduration=is.na(mydata[,"duration"]),
    proto=mydata[,"proto"])
mtab0=table(mtab0)
(apply(mtab0,2,function(x)x/sum(x)))

mtab1=data.frame(
    missing_orig_bytes=is.na(mydata[,"orig_bytes"]),
    proto=mydata[,"proto"])
mtab1=table(mtab1)
(apply(mtab1,2,function(x)x/sum(x)))

mtab2=data.frame(
    missing_resp_bytes=is.na(mydata[,"resp_bytes"]),
    proto=mydata[,"proto"])
mtab2=table(mtab2)
(apply(mtab2,2,function(x)x/sum(x)))

mtab3=data.frame(
    missing_local_orig=is.na(mydata[,"local_orig"]),
    proto=mydata[,"proto"])
mtab3=table(mtab3)
(apply(mtab3,2,function(x)x/sum(x)))
```
Thus we are missing the local_orig feature for every data point in the data set. We may then consider dropping this entire column as it serves no use to us and we cannot impute the data without prior knowledge of the data set and what it should look like. The duration, orig_bytes and resp_bytes all appear to be missing exactly the same data - on further analysis, we see that whenever one is missing, all three are missing. 

Some initial data cleansing will come from removing the X column and the ts column. The X column is produced by the sampling and since we have a random sample of the data, the ts provides no real information on the data.

```{r}
unique_uid <- mydata[!duplicated(mydata[,c('uid')]),]
unique_uid
```
Thus all our uid's are unique and therefore wont provide us with any extra information either since they will be uncorrelated with the rest of the data. This is the only column with this trait, and all other columns have values which occur more than once so we can drop the uid column too.

```{r}
drop_columns <- c("X","ts","local_orig","uid")
mydata <- mydata[, !names(mydata) %in% drop_columns]
```

```{r}
head(mydata)
```

So we have removed the columns that didn't provide us with any extra information. We will now extract the data we will use for DBSCAN to create clusters. The following code is pulled from Alex's workbook and allows us to pull out 7 of the features to use for DBSCAN and ensures all elements are numeric.

```{r}
miss.me <- vector(length = nrow(mydata))
miss.me <- rep(0, times = nrow(mydata))
for(i in 1:nrow(mydata)) {
	if(is.na(mydata$duration[i])) { miss.me[i] <- 1 }
	}
str(mydata)
mydata.good <- as.data.frame(cbind(id.orig_p = mydata$id.orig_p, id.resp_p = mydata$id.resp_p, 
orig_pkts = mydata$orig_pkts, orig_ip_bytes = mydata$orig_ip_bytes, 
resp_pkts = mydata$resp_pkts, resp_ip_bytes = mydata$resp_ip_bytes))
mydata.good<- cbind(mydata.good, miss.me)
head(mydata.good)
str(mydata.good) # Should be only ints and nums

for(i in 1:ncol(mydata.good)) { mydata.good[,i] <- as.numeric(mydata.good[,i]) }
str(mydata.good)		## All should be nums now
# sum(mydata.good$miss.me)/nrow(mydata.good) ## 82.7% missing

```

I dont want to drop any data that may be important so I'll also use the protocol, connection state and history features in my analysis.
```{r}
proto <- as.factor(c(mydata$proto))
proto <- unclass(proto)

conn_state <- as.factor(c(mydata$conn_state))
conn_state <- unclass(conn_state)

history <- as.factor(c(mydata$history))
history <- unclass(history)

mydata.good$proto <- proto
mydata.good$conn_state <- conn_state
mydata.good$history <- history

mydata.good
```

```{r}
	## We'll do 10-fold CV and then apply DBSCAN, training on 90%
dg <- mydata.good
ran <- sample(1:nrow(dg), 0.9 * nrow(dg))
nor <-function(x) { (x -min(x))/(max(x)-min(x))   }
dg_norm <- as.data.frame(lapply(dg, nor))
	# head(dg_norm)

dg_train <- dg_norm[ran,] 	## extract training set
dg_test <- dg_norm[-ran,]   	## extract testing set
dg_target_cat <- dg[ran, ncol(dg)]
dg_test_cat <- dg[-ran, ncol(dg)]
```

## SVD

Now we can look at running DBSCAN on our data. We first need to perform PCA to figure out how many principle components to use in DBSCAN.

```{r}
library("dbscan")
```

```{r}
dg_train.svd <- svd(dg_train)
```

```{r}
plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")
plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")
```
Plotting with a log y-axis and a normal y-axis give strikingly different results. The first eigenvector explains most of the variance and the 5 after that seeming explain almost the same amount of variance between them as the first one. I don't think using just the first eigenvalue would provide that much insight and therefore will use 6 of them (this is still reducing dimensionality by almost half).

```{r}
npcs = 6
```

We now plot the PCA to visualise the clusters formed here. We're not plotting according to any categorical data i.e. normal vs non-normal so we may not get that much information from this.

```{r}
i=1;j=2
plot(dg_train.svd$u[,i],
     dg_train.svd$u[,j],type="p",
     col="#33333311",pch=16,cex=1)
```

## Finding Parameters for DBSCAN

Eps specifies how close the points should be to each other to form a cluster. If the distance is less than eps, they are considered neighbours. We find this number by finding the 'knee' in the plot below. I have chosen to use 7 neighbours here.

```{r}
test=kNNdist(dg_train.svd$u[,1:npcs], k = 7,all=TRUE)
testmin=apply(test,1,min)
```

```{r}
plot(sort(testmin[testmin>1e-8]),log="y")
threshholds= c(0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.0001, col="red")
```

So we choose h=0.0001 as our limit since this allows us to capture most of the information here. We also need to define our minimum number of points to form a cluster. The recommendation is to use minPts = 2*dim for large data sets to ensure we find significant clusters and avoid noise so this is what we shall choose.

##DBSCAN

Now we finally perform DBSCAN.

```{r}
minPts = c(20, 25, 30, 35, 40, 45, 50, 75, 100, 125, 150, 175, 200, 225, 250)
clustercounts = c()

for(val in minPts) {
  dbscanres = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = val)
  clustercounts[val] <- (length(unique(dbscanres$cluster)))
}
```

```{r}
clustercounts
```

Thus we seem to hit some limiting value at 175 since a minPts of 200 has a higher amount of clusters than a minPts of 175. Since we want ou data to be interpretable we may look at reducing the amount of clusters and therefore the noise. Thus we will use an initial minPts of 175.

The output we get from this isn't very insightful though and leaves a lot to be desired. 

```{r}
dbscan175 = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = 175)
dbscan50 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 50)
dbscan30 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 30)
```

```{r}
library(cluster)
```

```{r}
# trying to calculate the silhouette score of this clustering to see if its valid or not - currently reports Error: Vector memory exhausted (limit reached?) - I've tried looking into work arounds but cant get anything working so I'll leave this for now.
#ss <- silhouette(dbscanres$cluster, dist(dg_train.svd$u))
```

## Plotting resulting clusters

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    }
}
```


```{r}
jpeg("Assessment2 DBSCAN Clustering.jpg")

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
dev.off()
```

```{r}
jpeg("Assessment2 DBSCAN Clustering Eigenvector split.jpg")
op <- par(mfrow=c(5,3))
for (k in 1:5){
  a = seq(k+1,6)
  for (l in a){
      if(k==l){next}
      plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab=k,
            ylab=l,
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    }
}
par(op)
dev.off()
```

```{r}
dg_train.clustered <- data.frame(dg_train)

dg_train.clustered$cluster <- dbscan175$cluster

dg_train.clustered
```

## M matrix

Now we look at the M matrix produced by Alex that is a sparse matrix showing connections between origin IP's and response IP's.

```{r}
library(Matrix)
library(irlba)
library(Rtsne)
```


```{r}
So1 <- tapply(mydata$id.orig_h, mydata$id.orig_h)
De1 <- tapply(mydata$id.resp_h, mydata$id.resp_h)
Est <- as.matrix(cbind(So1, De1))
M<- sparseMatrix(i=Est[,1], j=Est[,2])
```

```{r}
M.svd = svd(M)
```

```{r}
plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")
plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")
```

From the log axis it looks like we need the first ~100 eigenvalues but using the normal plot, it looks like we an get away with using ~30.

```{r}
npcsM = 30
```

```{r}
testM=kNNdist(M.svd$u[,1:npcsM], k = 7,all=TRUE)
testminM=apply(testM,1,min)
```

```{r}
plot(sort(testminM[testminM>1e-15]),log="y")
threshholds= c(0.1,0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.5, col="red")
```

It looks like we want eps = 0.5 although we dont seem to get a knee in the data so it's hard to pinpoint this.

```{r}
MminPts = c(200)
clustercountsM = c()

for(val in MminPts) {
  dbscanresM = dbscan(dg_train.svd$u[,1:6],eps = 0.5,minPts = val)
  clustercountsM[val] <- (length(unique(dbscanresM$cluster)))
}
```

```{r}
clustercountsM
```

```{r}
dbscan30 = dbscan(N.svd$u[,1:npcsM],eps=0.5,minPts = 30)
```


References:

1. [Data from SecRepo](https://www.secrepo.com)

2. [Converting categorical variables](https://stackoverflow.com/questions/47922184/convert-categorical-variables-to-numeric-in-r/47923178)

3. [Adding columns to data frames](https://discuss.analyticsvidhya.com/t/how-to-add-a-column-to-a-data-frame-in-r/3278)

4. [Finding Unique Values](https://stackoverflow.com/questions/41906878/r-number-of-unique-values-in-a-column-of-data-frame)

5. [DBSCAN on flowers](https://www.geeksforgeeks.org/dbscan-clustering-in-r-programming/)

6. [Saving Plots](http://www.sthda.com/english/wiki/creating-and-saving-graphs-r-base-graphs)

7. [DBSCAN Parameter Estimation](https://en.wikipedia.org/wiki/DBSCAN#Parameter_estimation)

8. [Finding the knee in kNNDist](https://www.rdocumentation.org/packages/dbscan/versions/1.1-5/topics/kNNdist)

9. [Silhouette Score introduction](https://medium.com/codesmart/r-series-k-means-clustering-silhouette-794774b46586)

10. [Error with silhouette score](https://stackoverflow.com/questions/51248293/error-vector-memory-exhausted-limit-reached-r-3-5-0-macos)

11. [Silhouette Function](https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/silhouette)